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Abstract 

Many Lattice QCD observables of phenomenological interest include so-called all-to-all prop- 
agators. The computation of these requires prohibitively large computational resources, unless 
they are estimated stochastically. This is usually done. However, the computational demand 
can often be further reduced by one order of magnitude by implementing sophisticated unbiased 
noise reduction techniques. We combine both well known and novel methods that can be ap- 
plied to a wide range of problems. We concentrate on calculating disconnected contributions to 
nucleon structure functions, as one realistic benchmark example. In particular we determine the 
strangeness contributions to the nucleon, (N\ss\N) and to the spin of the nucleon, As. 
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1. Introduction 

Advanced Lattice QCD calculations often require the evaluation of diagrams with discon- 
nected quark lines. Important examples are properties of flavour singlet mesons QJJ] , QCD spec- 
troscopy including multiquark and scattering states [2], the determination of hadronic scattering 
lengths J3l, and of isosinglet contributions to hadronic form factors and structure functions flEtl- 
Moreover, statistical errors of some standard observables like meson masses and electroweak de- 
cay constants can be reduced by averaging the sources over the lattice volume. In all these cases 
the standard point-to-all propagators that are obtained by calculating twelve (colour times spin) 
columns of the inverse lattice Dirac operator are not sufficient. Instead, all-to-all or timeslice-to- 
all propagators need to be computed. 

Lattices typically contain a number of sites ranging from V ~ 10 6 up to V ~ 10 9 . Inverting 
a lattice Dirac operator by conventional means would increase the effort in terms of computer 
memory and operations spent by this factor, relative to the cost of obtaining a single point-to-all 
propagator. An alternative approach to the problem consists of calculating an unbiased stochastic 
estimate of the propagator, replacing a factor 12V in effort by a number N of random sources, 
where in certain situations N can be as small as 10. Such estimation is permissible since, provided 
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it is unbiased, it will only add to the statistical errors. These are present in any case, due to the 
path integral evaluation of expectation values of observables by means of a Markov Monte Carlo 
simulation (for an introduction into Lattice QCD simulations see e.g. ((J). 

In general, the additional error introduced by this procedure will, for sufficiently large num- 
bers of estimates N, decrease in proportion to 1 / y/N. Ideally, N should be chosen such that this 
additional error is smaller than the statistical error induced by the Monte Carlo time series. The 
optimal number will depend on the observable in question and on the prescription employed to 
calculate the stochastic estimate. In this article we discuss different improvement methods aimed 
at decreasing the prefactor of the asymptotic 1 / y/N behaviour. In particular, we will introduce a 
new such method which we call the truncated solver method (TSM). We then test combinations 
of different improved stochastic estimator methods in a realistic Lattice QCD simulation. As 
our benchmark examples we choose to calculate the strangeness contribution to the spin of the 
nucleon As as well as the scalar strangeness content (N\ss\N). 

The spin of the nucleon can be factorized into a quark spin contribution AS, a quark angular 
momentum contribution L q and a gluonic contribution (spin and angular momentum) AG: 

^ = ^AS + L q + AG . (1) 

In the naive nonrelativistic SU(6) quark model, AS = 1, with vanishing angular momentum and 
gluon contributions. In this case sea quark contributions will be absent too and therefore there 
will be no strangeness contribution As in the factorization, 

AS = Ad + Au + As + ■ ■ ■ , (2) 



where in our notation Aq contains both, the spin of the quarks q and of the antiquarks q. Experi- 
mentally, As is usually obtained by integrating the strangeness contribution to the spin structure 
function g\ over the momentum fraction x. The integral over the range in which data exists 
(x > 0.004) usually agrees with zero. For instance a recent Hermes measurement in the region 
x > 0.02 yields [7] As = 0.037(19)(27). This means that non-zero results rely on extrapola- 
tions into the experimentally unprobed region of very small x and are model dependent [8, 9]. 
The standard Hermes analysis [10] yields As = -0.085(13)(8)(9) at a renormalization scale 
fi 2 = 5 GeV 2 in the MS scheme. Our results below suggest, As = -0.017(21), unless there are 
large effects in the chiral extrapolation from a pseudoscalar mass raps ~ 450 Me V to the physical 
point. 

The scalar strangeness density is not directly accessible in experiment but plays a role in 
models of nuclear structure. It is also of phenomenological interest since, assuming that heavy 
flavours are strongly suppressed, the dominant coupling of the Higgs particle to the nucleon will 
be accompanied by this scalar matrix element. 

The outline of this article is as follows: in Sec. |2]we introduce our notation and explain the 
stochastic estimator and noise reduction methods applied. In Sec.[3]we detail our lattice setup and 
employ these techniques to calculate disconnected quark loop contributions. Finally, in Sec.|4]we 
present our results on the axial and scalar nucleon strangeness matrix elements, before we con- 
clude. The systematic tuning of the parameters used in the truncated solver method is described 



in detail in Appendix A The TSM is quite generally applicable. This is also demonstrated 
in the appendix where the conjugate gradient (CG) solver is replaced by the popular stabilized 
biconjugate gradient (BiCGstab2) solver lUlll . Preliminary results appeared in [ 12] and fl3ll . 
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2. Noise reduction techniques 



As outlined above we require propagators M -1 connecting arbitrary pairs of lattice points, 
where M denotes a lattice Dirac operator. In our specific case we employ the Wilson quark 
action 1 1 4. 1611. Since the propagators have 12 Vx 12V components, where V denotes the number of 
lattice points, direct evaluation would be prohibitively expensive, both in terms of computer time 
and of memory. However, we encounter statistical errors anyway from the importance sampling 
of path integral expectation values. Hence it is sufficient to aim at an (unbiased) estimate, which 
can be obtained by stochastic methods 1U5I1 . For this purpose we introduce the following notation, 

J-7? :-L±A>. (3, 

N denotes the number of "stochastic estimates". Let I77'), i = 1, . . .,N be random vectors with 
the properties, 

fo>=0(l/V\), (4) 

Wn\ = 1 + 0(1/ y/N) ■ (5) 

These requirements are for instance met by complex Z2 noise where the 12V components are 
numbers with the uncorrected random phases <p £ {±^/4, ±37r/4). In llm Il7l Il8ll it has 
been demonstrated that real and complex Z2 noise reduces the variance, relative to other choices 
such as Gaussian or double hump noise. In our experience, similarly small variances can also be 
obtained with Z/y, U(l) or even with SU(3) noisy sources [ 19], indicating that the only important 
property is an equal modulus of the components. In the present study we employ complex Z2 
noise, where the components of the random vectors |t/) run over the spacetime volume, spin and 
colour. 

We define the Hermitian lattice Dirac operator Q = y^M. If we solve the linear systems, 

Q\j) = W), (6) 

for \s'), then we can substitute, see Eq. (O, 

QT l =kX^I+ QT l {t-mti) *E(Q-'):=mf\. (7) 

The difference between the approximation of Eq. (0 above and the exact result is unbiased and 
reduces like 1 / y/N. The sparse linear system of Eq. (0 can for example be solved by means of 
the conjugate gradient (CG) or the stabilized biconjugate gradient (BiCGstab2) algorithms. Note 
that in the CG case we can actually use even/odd preconditioning by employing the operator 
M^M = Q 2 . We then obtain \s') by multiplying the result with Q. 

The stochastic estimate approach reduces the problem from 0(12V x 12V) to 0(N x 12V), in 
terms of memory and computer time. The stochastic error will remain roughly constant if is 
scaled like y/Va 2 with the lattice volume, where a denotes the lattice spacing. In order to limit 
the computational effort, should not be chosen overly large. However, in the end the stochastic 
noise should not be the dominant source of statistical error. In general, the optimal balance be- 
tween the stochastic sampling on single configurations and the Monte Carlo importance sampling 
of gauge configurations will depend on the observable in question and on the methods used. 
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Due to the difference between |s)(77l and Q l above, any fermionic observable A can only be 
estimated up to a stochastic error AA, s toch —0(1/ VAO on a given configuration. We define the 
configuration average (-) c over n uncorrected configurations and normalize this appropriately: 

< s toch == ■ (8) 

For large N and n this will scale like cr 2 A stoch oc (Nn)~ l . We also define the total error cr A tot as 
the variation of the obtained estimates of A over gauge configurations. At fixed N this will be 
proportional to 1 / yfn. Obviously, this total error is limited by the stochastic erroiQ: 

(9) 

If cr 2 A stoch cr 2 A tot then it is worthwhile to improve the quality of the estimates while if cr 2 A s(och <sc 
cr 2 A t then precision can only be gained by increasing the number of configurations «, possibly 
reducing N to save computer time since the rC x scaling cancels from the above inequality. 

To minimize the stochastic noise at a given computational effort, we combine a multitude of 
techniques: 



1. partitioning (also known as the spin explicit method or as dilution) [18,21, 22] 

2. the hopping parameter expansion (HPE) technique 1E3II24II25I1 . 

3. the truncated solver method (TSM) lfl2ll and 

4. the truncated eigenmode acceleration (TEA) J26, 25. 221. 



These methods are explained below. 
2.1. Partitioning 

We decompose % — V ® colour ® spin into m subspaces Hf. % = 8ff =1 ( Rj- One can then set 
the source vectors \rf.) to zero, outside of the subspace "Rj. We label the corresponding solutions 
as \s'j). The solution for the all-to-all propagator is then given by the sum, 

m 

QT^^vm. (10) 

7=1 

This procedure results in an m-fold increase in the total number of solver applications. The term 
1 _ \n){n\ multiplying Q~ l in Eq. (0 only has off-diagonal entries, all of similar sizes. Therefore 
the off-diagonal entries of Q l will determine the stochastic error of the particular observables 
in question. Q l will exponentially decay with the spacetime distance between source and sink: 
spacetime components in the neighbourhood of the sink position will yield the leading contribu- 
tions to the stochastic variance. Likewise the noise for spin components that are strongly coupled 
to each other by Q l , multiplied by the relevant T and derivative structures, will dominate. Ide- 
ally, partitioning will black out the largest contributions to the stochastic noise. If the achieved 
reduction exceeds the 1 / -\Jm factor, then the computational overhead is justified. This overhead 



For an exact calculation of A (o"A,stoch = u )> a Monte Carlo error cr^ stat = o"A tot x 1 / V" can be introduced. In the 

2 - it 2 - rr 2 1 
A.stat "AJU A.sloch 



limit of large N and n, the factorization a 2 — "- 2 — 



need to introduce this quantity. 



can be somewhat reduced at the expense of memory by sophisticated preconditioning techniques 
and/or aggressive deflation |27, 28[ 22l HI 31, 321: the constant set-up costs of such techniques 
become more affordable with a larger number of right hand sides. We experimented with various 



patterns and combinations of colour, spin and time partitioning, see ej* 113 311 . Partitioning as a 



stand-alone solution works very well in many situations J2ll [34L 20l 221. However, such gains 
are mostly eliminated once partitioning is combined with the other tricks. Time partitioning is a 
notable exception: in situations where only timeslice-to-all propagators are required there is no 
increase in the number of solves but the variance is still reduced. In the calculation of discon- 
nected contributions to the nucleon structure it also turns out to be useful to generate the current 
insertion at more than one timeslice, e.g. to average the correlation with a nucleon propagating 
from a given source in the forward direction and the backward propagating one (which comes 
for free). In this case one can seed the random sources at two (or more) well separated timeslices 
and still gain from partitioning, without any associated overhead. 

2.2. The hopping parameter expansion technique 

The stochastic noise from terms that are close to the diagonal is accompanied by larger am- 
plitudes than terms that are far off the diagonal, see Eq. O and the discussion in Sec. l2.1l above. 
Hence the cancellation of near-diagonal noise requires a comparatively larger number of esti- 
mates. The HPE aims at eliminating some of the near-diagonal noise contributions by exploiting 
the ultra-locality of the action. Thus it cannot be generalized for instance to the Neuberger ac- 



tion 13511. 



We rewrite the fermionic matrix as, 

2kM = 1-kD. (11) 
The HPE is based on the observation that one can expand, 

00 k— 1 

AT 1 = 2k ^] (*£>)' = 2k ^] (*£>)' + (*£>)* AT 1 , (12) 

;=o i=o 

where k > 1 . For distances between source and sink consisting of more than k links, the first term 
on the right hand side does not contribute since D only connects nearest spacetime neighbours. 
Therefore, = [(kD^M^ 1 ] vv for sufficiently large source and sink separations. However, at 
finite N, {E[M~ ']}.„, + {E^/d^AT 1 ]}.„(= {(KD/'EfAT 1 ])}^), where the variance of the latter 
estimate of M" 1 is reduced. This was for instance exploited in 12511 . In some cases additional 
powers of D can be gained due to the F structure of the creation and annihilation operators. 

Here we study closed loops, i.e. x = y. Obviously, only even powers of D contribute to 
Tr(M _1 r), where F e {1, y M , cr^, y^y 5 , y 5 }. We can write, Tr(M _1 l) = 2/rTr 1 + K k Tr(D k M~ l ), 
where the first term can be computed trivially. In the other cases, Tr(M _1 T) = icTx{D^M T). 
For the Wilson action, k — 8 for F e {7^75,75} and k — 4 otherwise. For the Sheikholeslami- 
Wohlert action 13611. k — 2. The lowest non-vanishing terms have been calculated analyti- 
cally 1 23 ] and can be computed and corrected for exactly (unbiased noise subtraction) 123, 37, 5ll. 



We limit ourselves to the highest vanishing order (k — 4 or k — 8 for the Wilson action), where 
such subtraction is not necessary. In this case the computational overhead of the HPE is small, 
such that this substitution can easily be combined with the TSM that we explain below. 
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2.3. The truncated solver method 

It has been noticed long ago lT38h that solvers typically converge to the correct result within an 
accuracy of the size of the stochastic error after a relatively small number of iterations. Practition- 
ers have therefore sometimes relaxed the requirement on the residual when solving for stochastic 
sources. This is not unproblematic since it introduces a systematic bias that will be invisible on 
one configuration but might very well affect the result obtained on a sample of, say, 200 con- 
figurations. We label the result obtained after n t solver iterations by k( n ) ), where omitting the 
subscript means convergence within numerical accuracy. We can now factorize: 

, Ni , Ni+N 2 

Q l - Etg- 1 ) := - J] l** w >tf I + TT X fyh - K„)>) tfl • (13) 

i=l i=JVi+l 

The above equation is an exact linear decomposition of Q~ l and the algorithm used to calculate 
both parts is well defined. Due to these properties and the fact that the I77') with i < N\ are uncor- 
related with those for i > N\, the resulting estimate is unbiased. Ideally, one will generate a large 
number N\ » N2 of relatively cheap estimates at small n t and then remove the bias by correcting 
with a small number N2 of expensive solutions to machine precision. This method can easily be 
combined with all the other methods described here. It is possible to tune the two parameters 
Ni IN2 and n t that enter the algorithm to the point of minimal variance at fixed computer time in 
a relatively inexpensive and straightforward way. This is discussed in | Appendix A| 

In some sense the underlying philosophy of TSM is similar to estimating the cheap first term 
within the HPE Eq. (TTZb with many random sources and the expensive second term with only 
a few sources. However, iterative solvers like the CG converge faster than the HPE. Moreover, 
TSM is applicable to any fermion action, not only to ultra-local ones, and efficient for any quark 
mass. TSM can be combined with HPE, see Sec. l2.5l below. 

To check our implementation of the method we compared the exact result for (M -1 )™ 1 '* 202 , 
where s\Cy denotes the spin and colour indices, x = (0, 0, 0, 3) and y = (/, 0, 0, 3), i = . . . 10, 
with an estimate obtained from Eq. ( fT3b . We indeed find consistency within errors for different 
« t , Ni and N2. For example, for n, = 5, Ni = 5500, N2 - 300, i = 1, Si = S2 = 1, c\ = c% = 2, 
E[M;,i] = (0.0300(7), -0.0014(7)), compared to the exact result of (0.0302 . . . , -0.0010 . . .). 



In Appendix A we also demonstrate that the TSM can be generalized to other solvers, by 
employing BiCGstab2. However, the smooth convergence of CG, that is relatively independent 
of the random source and gauge configuration, is a clear advantage when it comes to deciding on 
a TSM parameter set. Moreover, in the CG case the combination with the truncated eigenmode 
acceleration bears less computational overhead and is compatible with even-odd preconditioning 
(see below). We emphasize however that the converged solutions \s l ) within Eq. (TT31 ) can be 
obtained by using any efficient solver. In particular, there is no need to employ the same solver 
as that used for obtaining the truncated solutions \sV The truncated solver of course needs to 
be the same for i > N\ as that used for i < N\ . 



2.4. The truncated eigenmode acceleration 

We define the n e smallest (real) eigenvalues q, of Q and the corresponding orthonormal eigen- 
vectors \uj), i — l,...,n: 

Q\ui) = qi\ui) , (ui\uj} = 6ij. (14) 

These can for instance be calculated by means of the parallel implicitly restarted Arnoldi method 
(IPvAM) with Chebychev acceleration ||39[] or by Ritz methods 114011 . 
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We can now approximate [39], 



Q l *J]W<«/I- (15) 
i=i 

However, this estimate is biased. We define the projection operator P,,., onto the complement of 
the space spanned by these n e eigenvectors, 

P„ = 1- 2 l".><"il • (16) 

Projecting the stochastic sources onto the orthogonal complement, 

QW x ) = \rf A ):=^n t \rf), (17) 
yields the new unbiased estimate ll25ll . 



Q l * E(Q- 1 ) := £ [udqfiUil + \s x )(r, x \ , (18) 
i=i 

where a left projection of \s i 1 ) is not necessary since [Q, P„J = and thus P„ e Q P„ e = 2 Pn e - The 
above estimate will usually have a reduced variance. Note that in the literature exact point- to-all 
propagators have also been combined with a truncated low eigenmode all-to-all calculation to 
achieve smaller gauge errors (low mode averaging) 114 lL 14211 . 

A nice side effect of TEA lies in the acceleration of the solver, by deflation ll43l I27I I29I [ill 



441 13211 . The condition number of the proj ected operator Q P„ c and therefore the number of solver 
iterations are reduced. Within some algorithms like the CG the fact that the projector commutes 
with the operator Q guarantees that the Krylov subspace remains confined within the orthogonal 
complement. So in these cases no further intermediate projections are necessary to fully exploit 
the potential of deflation. 

Note that while the number of undeflated solver iterations increases like 1 /w| s at small quark 
masses, the efficiency of the eigenvalue calculation remains the same. Unfortunately, at small 
mps, one would like to increase the linear lattice extent L oc 1/mps. In this case, the rank of the 
deflation space in the worst case will increase like n e oc L 4 oc Va 4 . Depending on the volume 
required this may or may not become a serious problem. The volume scaling of stochastic 
methods is somewhat more favourable: the number of estimates needs to be adjusted at most like 
VVa 2 (this can be somewhat reduced by partitioning). From these considerations it becomes 
evident that the optimal choices of n e and N strongly depend on the volume, the quark mass 
and the lattice spacing. The same holds for the algorithm where a deflated CG can outperform a 
(deflated) BiCGstab2, in particular when combined with the truncated solver method. 

Obviously, it is also possible to decompose M, rather than the Hermitian operator Q = 75 M 
of Eq. (fTST l. into eigenmodes [45]: M|r,) = /l,|r,). However, in this case we end up with a 
biorthonormal system ((i\rj) = Sjj, where the left eigenvectors \€i) will differ from the right 
eigenvectors: {ii\M = One can now decompose, 



M" 1 * £ WVil 



(19) 




Figure 1 : Connected and quark-line disconnected current insertion into the nucleon. 



It can easily be seen that due to the property M' = J5MJ5, A, :- A* is an eigenvalue whenever 
Aj is an eigenvalue, with a left eigenvector (€i\ = (r^ys and a right eigenvector |r,) = ysK,). The 
advantage of this decomposition is that the eigenvectors are independent of the quark mass and 
the eigenvalues at different k values are trivially related to each other. This follows from the 
structure Eq. (fTTT i. In this article we will not pursue this alternative eigenmode decomposition 
any further. However, the decomposition of M -1 might converge better in some channels than 
that of Q l and vice versa. While this biorthogonal approach is incompatible with deflating the 
CG solver, it would be the natural starting point for BiCG solvers. 

2.5. Combining the methods 

Partitioning can trivially be combined with the other three methods. However some notes 
are in place with respect to combinations of HPE, TSM and TEA. Within the HPE a left- 
multiplication of the estimate E[£T'] with D k is essential since we have defined M~ l = Q x ys- 
A right-multiplication would involve ysDjs = £>' . The easiest way of implementing HPE is 
to multiply the solution vectors with (i<D) k , prior to any application of smearing functions or 
contractions but after TEA and TSM. 

Within Eq. ([ToT i only the projected source vectors appear. Hence after the projection, Eq. ( fTTI ). 

\r} i D = W)-Yj(ujW)\u j ), (20) 

j=i 

the original noise vectors \t]') can be discarded. 

The overhead from the projection can be significant when combined with the TSM, Eq. (1131 1. 
where the large number of estimates Ni implies a large number of projections and the small 
number of solver iterations « t indicates that not much computer time will be spent between these 
projections. The projection overhead can be reduced by restricting the computation of the inner 
products to the partitioning subspace: many components of I77') might have been set to zero 
if the partitioning method was used. This saving cannot be obtained at the sink. Fortunately, 
[Pn ev >2] = 0, such that the solutions \s' ± ) remain within the orthogonal complement and no 
second projection is necessary, provided the residual of the solver is chosen sufficiently small! 
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Obviously the residual will not be small for the truncated solutions, which therefore one 
would wish to project back into the orthogonal complement (before application of the D opera- 
tor within the HPE). Fortunately, the (even-odd preconditioned) CG algorithm never leaves the 
orthogonal subspace: in this case, such an additional projection is never necessary, even if the 
solver is not run to convergence. Neither is such a left projection strictly required for solvers 
without this property, as long as the precision of the part that is run to convergence is sufficiently 
large. In this case, the Ni biased estimates will pick up some unwanted low mode contributions 
that will however be corrected for by the N2 estimates of the bias. This might or might not 
increase the stochastic errors. 

For completeness we mention that domain decomposition techniques have been suggested in 
the literature I46j.| 47 [|. These can easily be combined with TSM and TEA too, as can the so-called 
one-end-trick [48 , 49[ H • We remark that if a bosonic representation is employed l46l 51], 



estimating M' M - Q 2 instead of Q, then the TSM can in principle be substituted by a quasi 
heatbath strategy B52I1 . 



3. Evaluation of the disconnected loop 

3.1. Lattice setup 

We study combinations of the variance reduction techniques outlined above, using configu- 
rations provided by the Wuppertal group: these are tif = 2 + 1 dynamical configurations with 
V = 16 3 x 32 lattice points, generated using a Symanzik improved gauge action and a stout-link 
improved (rooted) staggered fermion action. The lattice spacing is fairly coarse, aT l * 1.55 GeV, 
while the spatial extent is around 2 fm. Further details can be found in ll53ll . For the valence 
quarks we use the Wilson action with k = 0.166, 0.1675 and 0.1684, corresponding to pseu- 
doscalar masses of about 600, 450 and 300 MeV, respectively. The analysis is performed on 
326 configurations at A-i oop = 0.166, 167 configurations at K\ mv = 0.1675 and 152 configurations 
at Kioop = 0.1684, where K\ oop refers to the k value of the disconnected loop. Our main results 
are obtained using the CG solver with even-odd preconditioning. However, results obtained 
with BiCGstab2 are given in |Appendix A| The code used throughout is a modified version of 
Chroma El El. 

3.2. Results for the disconnected loop 

We wish to calculate nucleon structure observables. For this purpose a nucleon will be created 
at some initial time fo and destroyed at a later time tf » to, with a current inserted at some 
intermediate time t. This is illustrated in Fig.Q] The disconnected loop Tr(M~'F) within the right 
diagram of the figure will be calculated with stochastic all-to-all techniques and can subsequently 
be combined with the nucleon two-point function, calculated in the standard point-to-all way, on 
a configuration by configuration basis. 

A preliminary picture of how well the variance reduction techniques work can be obtained 
by studying the zero-momentum projected disconnected loop XixTi^M^F) alone, where x = 
(x, t). With the exception of T = 1, expectation values of these loops over many configurations 
vanish, due to the discrete charge and parity symmetries of QCD. However, the expectation 
value of the correlation between loop and proton (with a momentum injected or with a specified 
helicity) can be non-zero. Likewise, with the exception of trivial cases such as lmTr(M~ l y5) = 
0, the loops will not vanish on single configurations. Using the Euclidean spacetime convention 
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Figure 2: Truncated estimates of the zero momentum projected Tr(M~'r), obtained after « t CG solver iterations for 
r = 1 (left) and for T = 7375 (right) at Ki oop = 0.166. The results are averaged over 300 stochastic sources. Horizontal 
lines indicate the result with statistical errors at convergence. 



{7m'7v} = 2<5/iv> 75 = j\j2jT,jA, we are interested in evaluating ReTr(M 'l), ImTr(M 1 y ll ), 
ImTrfM-V^y) = ReTrCM-'y^y,,) for pi + v, ReTr (Ar'y^ys) and ImTr (AT'ys)- 

We investigate the reduction in computer time of optimized stochastic estimates, relative to 
the situation without the use of any improvement techniques (except for time partitioning: all 
disconnected loops are calculated on one timeslice only). We state all costs in terms of the 
average real computer time required on a Pentium 4 PC for one solver application (unimproved 
estimate), where we account for all overheads of the improvement methods. We also ran the 
parallelized code on IBM p6 and Opteron clusters as well as on BlueGene/L and BlueGene/P 
systems. The overall efficiencies of matrix-vector multiplies and global sums depend on the 
architecture, however, the gain factors defined below are not significantly affected. 

We define the gain as the ratio, 

. var{E [Tr(M- 1 r)]} 
var{E imp [Tr(M T)]} 

taken at fixed real computer time, var denotes the variance between stochastic estimates on 
a given gauge configuration, Eo and Ei mp stand for the unimproved and improved estimated, 
respectively. 

We start by investigating the TSM at the heaviest quark mass, K\ oop = 0.166. Within TSM the 
combination Tr(M~'F) is obtained as an average over N\ truncated solutions \s'^ n .): (^'lysT^ ,). 
This value is then corrected by N2 Ni estimates of the resulting bias, see Eq. ( fT3l ). The faster 
the truncated estimate as a function of n t approaches the estimate obtained at full convergence, 
the better the method will work. In this study we only consider currents of local quark bilinears, 
where we use the F conventions of l54Tl . We observe very satisfactory convergence rates for all 
the 16 possible F structures. In Fig.|2]we illustrate this for the even/odd preconditioned CG solver 
for the worst case (r = 1) and for the best case (r = 7375). Note, however, that the unimproved 
estimate of F = 1 is already very precise to start with. Convergence at this k value is reached 
after « conv a 480 CG iterations. 

Applying TSM involves making choices for n t and for N\ /N2. We detail our optimization pro- 



cedure in Appendix A This systematic tuning results in similar amounts of computer time being 
spent on the truncated estimate as on estimating the bias. In particular we find N\ /N2 ~ n conv /n t . 
Performing this optimization on a single configuration appears sufficient since the variance is not 
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E imp [Tr(M-T)] 
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Table 1 : Optimized TSM values for n t and jVi /N2 , see Eq. I13K at k = 0. 1 66 for a subset of the Ts studied, calculated on 
one configuration using 300 estimates. The approximate gains obtained at fixed cost, using these values, are also shown. 
Where our method is combined with the HPE technique, k indicates the order used. 
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Figure 3: Estimates of the zero momentum projected Tr[(/cD)*M _ 'r] at /q oop = 0.166 for T = 1 (left) and T = 7375 
(right). The errors are obtained from 300 estimates on one gauge configuration and the zero-order HPE contribution for 
r = 1 was calculated explicitly. 




much affected if n t and N1/N2 are changed by 20 %. Fluctuations between configurations turn 
out to be smaller than this value. 

In TableQ] we display the optimized parameters at K\ oop = 0. 166 for a representative choice of 
F structures (scalar, vector, tensor, pseudoscalar and axial vector), together with the approximate 
fixed cost gains, Eq. (fJTT i. These factors vary between 5 and 10. In a real production run one 
would not wish to generate new sets of optimized estimates for each F structure or observable one 
is interested in. Fortunately, with the still tolerable exception of T = 1, n t and Ni /N2 exhibit only 
a mild F dependence such that similar overall gains can still be achieved with just one parameter 
set. 

TSM can trivially be combined with the hopping parameter expansion. At Ki oop - 0.166 
the gains from applying a stand-alone HPE are almost as big as those from the TSM. However, 
the HPE does not work well for r = 1, see Fig. [3] Both methods aim at removing noisy short 
range contributions from the estimates. Clearly, the gain from combining the two methods will 
be smaller than the product of the two isolated gains. However, the separation of long and short 
range is organized differently. The HPE only works for ultra-local actions and explicitly removes 
terms up to a particular lattice hopping radius. It will be less convergent and hence less effective at 
very light quark masses. The TSM is more generally applicable and separates short range modes 
from long range ones, but this is done within the Krylov space of the solver. When combining 
HPE with TSM, the reduced variances of the truncated estimates result in larger optimal n x values. 
The constant computational overhead per solve of the additional D applications also pushes n t 
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Table 2: The TSM gains without and in combination with the HPE at different quark masses. 
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Figure 4: Squared residual of the solver, as a function of the number of CG iterations. P„ cv denote the outcomes, after 
deflating the n cv lowest modes. 




towards larger values. Table Q] demonstrates that the combined gain of these two methods at 
"loop = 0.166 can be as large as a factor of 30. 

Having tested the method at K\ oop = 0.166, where mps ~ 600 MeV, we also use it to calculate 
the disconnected loop at K\ oop = 0.1675 and at K\ oop = 0.1684, corresponding to pseudoscalars 
masses of approximately 450 MeV and 300 MeV, respectively. The resulting gains are displayed 
in Table|2] While the TSM performance is very independent of the quark mass, the HPE becomes 
less effective at lighter masses, in agreement with our expectation. Still, even at the lightest quark 
mass, the combined gains range from factors of 6 (for F = 1) up to 19 (for T = 7375). 

At light quark masses where the HPE becomes less effective, the low eigenmode contribu- 
tions to hadronic observables might become more dominant J39L [41, 42, 25t] . To study this effect, 
we deflate the lowest eigenmodes at the lightest pseudoscalar mass. As expected, this accelerates 
the solver 1431 l26l l27l l28l l29l |3CX 13 lL l44l 13211 . We display a typical residual, as a function of the 
number of CG iterations, without deflation and deflating the lowest 5, 10, 20 and 50 modes of 
the Hermitian Wilson-Dirac operator in Fig. [4] 

The optimized TSM parameters at A-i oop = 0.1684 for combining TSM with HPE as well as 
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Table 3: The optimized TSM parameters and gains obtained at K[ oop = 0.1684, combining TSM with HPE, with and 
without TEA (deflating 20 eigenmodes). The gain factors are normalized to the costs of generating 300 and 100 standard 
estimates, respectively. 
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Figure 5: The numbers of iterations to convergence. For the heavier two k values (left) the comparison is with 70 sources 
per configuration, for k = 0.1684 (right) we only utilize 10 sources. 



for combining TSM with HPE and the TEA of the lowest 20 eigenmodes are shown in Table [3] 
The faster rate of convergence of the deflated solver leads to smaller n t values and therefore 
to larger Ni /N2 ratios when TSM is combined with TEA. When normalized to the real cost of 
generating 300 unimproved stochastic estimates, all gain factors increase by more than a factor of 
two, and this in spite of one quarter of the time being spent on generating and projecting onto the 
eigenvectors. However, this factor can fully be attributed to the acceleration of the solver: while 
there exist observables where stochastic errors decrease when applying the TEA SEHEl, for 
the quark loops that we investigate here this is not the case, at least not at pseudoscalar masses 
above 300 MeV. We observe a break-even between TSM+HPE+TEA(P 20 ) and TSM+HPE when 
matching the cost of approximately 100 estimates, as can be seen in the last row of the table. In 
lattice simulations we will encounter gauge errors from the Monte Carlo time series, in addition 
to the errors from the stochastic estimates on single configurations, discussed here. This interplay 
is studied below. 



3.3. Configuration averages 

When calculating averages over configurations, the stochastic errors cr stoc h, that are defined 
in Eq. ([SJ, should be smaller than say half of the final Monte Carlo errors cr tot . However, making 
0"stoch unnecessarily small can be a waste of computer time that would be more wisely spent 
on analysing more configurations or realizing additional source positions. We start to study this 
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balance of errors by computing expectation values over configurations for the disconnected loops. 
In Sec. |4]below we will then calculate the three point functions that are of physical interest. 

We remark that the (zero momentum projected) gauge average (ReTr(M _I l)) c will have 
a non-trivial value, while (Re Tr(M~' y M y5)) c = 0. We first assess the impact of low mode 
deflation. In Fig. [5] we display the number of CG iterations until convergence n conv for several 
random sources on different gauge configurations. Without deflation, reducing the quark mass 
from raps ~ 600 MeV (k = 0.166, black symbols of the left figure) to m PS » 300 MeV (k 
0. 1 6X4. red symbols of the right figure), we find n conv to grow from about 500 to 1800 iterations. 
This approximately four-fold increase is consistent with the expected 1 /»tp S behaviour of the 
condition number of M (we solve for Q 2 = WM). Deflating 5 eigenmodes (P5) at k — 0.1675 
reduces the required number of iterations by one third (red versus blue symbols of the left figure) 
and deflating 20 modes at k = 0.1684 results in a two third reduction (red versus blue symbols of 
the right figure). We observe variations of n conv between and within configurations that increase 
with decreasing quark mass, as has been investigated quantitatively in [57]. Deflating the lowest 
modes vastly reduces the variance in n conv . Clearly, the TSM parameter tuning benefits from this 
stabilization. 

We now average our improved estimates over n configurations, where we employ TSM, HPE 
and, at the lightest k value, TEA with a projection onto the 20 lowest modes. In addition, we 
calculate up to 100 unimproved estimates on each configuration. The results as functions of 
the real computer time spent, in units of the cost of one unimproved estimate, i.e. of one solve, 
are displayed in Table [H for Y — 1 and Y = jjjs, where we average over the three possible 
/'-directions. The total errors cr tot are displayed in brackets, followed by their respective lower 
bounds, as given by the stochastic errors cr stoc h. 

For r = 1, even at the cost equivalent of six standard estimates, the gauge errors still domi- 
nate over the improved stochastic errors. Therefore, there is little point in exceeding this number. 
On the given ensemble, the same error can be obtained using 25 to 50 standard estimates, sug- 
gesting a five-fold saving in computer time. However, this is somewhat misleading since, by just 
increasing the number of gauge configurations by a factor of 1 .6, the same error can be obtained, 
employing six unimproved estimates. It should be noted that the error balance could in princi- 
ple look differently, once the loop is correlated with a nucleon propagator, within a three point 
function. Moreover, the stochastic error will become more relevant on large volumes. At the 
lightest k value, due to the overhead of setting up TEA, the cost of the improved estimates will 
always exceed the number 6. In fact, as can also be seen from Table [4] TEA only turns out to 
be a worthwhile enterprise for a cost equivalent bigger than 90. Hence, unless the eigenvectors 
have been generated anyway, for instance to enable low mode averaging 14111 for the two point 
functions or for the calculation of other current insertions, TEA is best omitted for Y — t. 

The picture for the axial current containing Y = jjjs is different: here at the cost equivalent 
of 100 standard estimates, the stochastic error still accounts for one quarter of the total error and 
the gain of applying the improved method is in all cases more than four-fold. This advantage 
should increase further at larger volumes. Note that the reductions of the stochastic errors alone, 
for the scalar and axial cases, are consistent with the results of Table [2] that were obtained on a 
single configuration. 

Based on these results, we decide to invest the cost equivalent of 100 even/odd preconditioned 
CG solves into the TSM/HPE/(TEA) estimation of the axial loop and of 6 CG solves into the 
estimation of the scalar loop. Note that for the calculation of a standard point-to-all baryonic 
two-point correlation function usually 12 such solves are required, and even more sources if a 
variational basis is used to optimize the creation operator. We display the resulting stochastic 
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Table 4: Monte Carlo and stochastic errors for the (TSM+HPE) improved and unimproved estimates of Re Tr (M~ 1 T). 
The stochastic errors have been normalized according to Eq. (8). For (q oop = 0.1684, 20 eigenmodes were deflated 
(TEA). The cost is in terms of the number of unimproved estimates. Within the asterisked rows, the calculation of these 
eigenvectors is not folded into the cost calculation. 
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Figure 6: The zero momentum projected estimates Ej rap [Re Tr(M _1 1)] (left) and 5 2; E; mp [ReTr (M~ l yjys)] (right) at 
k = 0.166, with stochastic errors A stocn , on different gauge configurations. The scalar loop was generated investing the 
cost equivalent of 6 CG solves. The axial loop was evaluated at the cost of 100 solves. The large error bars on the right 
of the figures correspond to unimproved estimates Eq. 



errors A stoc h on single configurations at K\ oop = 0.166 in Fig. [6] This graphical representation 
again makes it evident that even for this low-cost setting the stochastic errors are much smaller 
than the gauge noise. The larger error bars on the right of the figures correspond to unimproved 
A s toch values, obtained at the same cost. After application of the improvement methods, the 
stochastic errors are so small that nothing extra can be gained from increasing the number of 
stochastic sources (and solves) beyond these moderate values. 



4. Application to Aq Ais and to (N\qq\N)' lis 

4.1. Definition of the matrix elements 

We now apply our methods to the calculation of observables of phenomenological interest, 
namely of Aq and (N\qq\N). The contribution to the nucleon spin Aq is defined through the 
matrix element (in Minkowski space notation), 

Aq 

(N, slqy^qW, s) = IntuS^ — , (22) 

where m^ denotes the mass of the nucleon N and s M its spin (s£ = -1). Aq and (N\qq\N) are 
extracted from the ratios of three-point functions to two-point functions (at zero momentum): 

Re (rfxf"(tf) 2 X Tr (M^x, f; x, t)Y loop )) 

KM 

For the scalar matrix element we use, r2 pt = r unpo i := (1 + y^)/2 and ri oop = 1. For Aq we 
calculate the difference between two polarizations: F2 pt = yjy5F unpo i and Fi oop = 7/75, where 
we average over all three possible y'-orientations. The spin projection operators along the j- 
axis read, P^i = ^(1 + zy/ys), so that in this case, r2 pt = —i(Pf - PiW unpo \, where we have 
traded a factor i against taking the real part, rather than the imaginary part, of the nominator 
in Eq. (f23T >. The variance of the above expression is reduced by explicitly using the fact that 
ImTrCM" 1 !) = ImTr (AT V/ys) = 0. 
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Figure 7: The effective mass, Eq. (26), of the nucleon smeared at source and sink for k = 0. 166. The time t is displayed 
in lattice units a k 0.13 fm. The result of a fit to the time range 3 < f < 16 is shown as horizontal lines. 



The two-point function of the zero momentum projected proton with sink and source spinor 
indices a and (3 is given by, 



(24) 



where we have set to = 0. This can be constructed from standard point-to-all quark propaga- 
tors |0]. Note that for q — u,d there are additional connected contributions R coa , which we have 
not calculated. We combine the three K\ oop values with «2pt = 0.166 and 0.1675. 
In the limit of large times, ff » f » 0, in the axial case, 

(N,s\qjjy 5 q\N,s) 



R ms (t, f f ) + R con (t, f f ) 



2m a 



Aq . 



(25) 



The prefactor two comes from taking the difference between two opposite polarizations rather 
than fixing one polarization. We note that this result will be in a lattice scheme and still needs 
to be multiplied by a renormalization factor of 0(1), for a translation into the MS scheme. In 
the scalar case, (N\qq\N) is defined as the connected contribution only and can thus be obtained 
by subtracting <0|^|0> = - 2 X Re ( T r (M~ l (x, f; x, f)} from Eq. d23j. We will label the discon- 
nected contributions to these two matrix elements as, Aq dls and (N\qq\N) Als , respectively. In the 
case of the strange quark, As = As dls and (N\ss\N) = (N\ss\N) dls . 

4.2. Results of the calculation 

We calculate the disconnected loop on only one timeslice t = 3 » 0.38 fm/a. Having the 
operator inserted close to the source (to = 0) reduces the statistical errors but care must be taken 
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Figure 8: The matrix elements (NlqqlN) 6 ™ (left) and Aq As (right) for different values of tf at *ri oop = (Qpi = 0.166. 



that contributions from excited states are suppressed so that Eq. (123b still holds. Following [25], 
we use Wuppertal smearing on top of APE smeared links at the source and sink for both the three 
point and two point functions. In Fig. [7] we display the effective mass, 

-1 , / C2pt(0 \ 

wjv,eff(0 =a In — — — - , (26) 
\C2pt(* +1)/ 

as a function of the time. This shows that with our choice of smearing parameters the excited 
state contributions are not significant at t = 3. Hence, in the symmetric situation, tf = 6, such 
contributions should be negligible. This is even more so since the statistical error will be much 
bigger than that of the two point function at t = 3. The consistency of this assumption can be 
checked by varying tf > 4. 

In Fig. [8] we display finite time estimates of the matrix elements {N\qq\N) dls and Aq Als in the 
lattice scheme, for different final times tf at K\ oop = K2 pt = 0.166. This value approximately cor- 
responds to the mass of the strange quark. The stochastic estimates of the loops were generated 
using time partitioning (f = 3), TSM and HPE, at the cost of 6 even/odd preconditioned CG 
solves for the scalar and at the cost of 100 such solves for the axial current. Our analysis of the 
two point function above suggests that finite t systematics should be small, relative to the statis- 
tical errors at tf = 6. Indeed, all tf > 4 data are consistent with a constant. We conservatively 
quote the tf = 6 values as our final results. We realized all six K\ oop e {0.166, 0.1675, 0.1684} and 
K2pt £ {0. 166, 0. 1675} combinations where the lightest K\ oop value corresponds to mps ~ 300 MeV. 
The five combinations that are not shown result in the same general picture, with larger statistical 
errors. For the two point function we do not realize the lightest k value since, without low mode 
averaging, this turns out to be very noisy. 

In Table [5] we display improved and conventional stochastic estimates of the scalar matrix 
elements (obtained at to = 0, t = 3 and tf = 6 « 0.76 fm/a), for all our k combinations. The fixed 
cost reductions, due to TSM and HPE, in the relative errors are small and do by far not match the 
gain factors that we obtained in Sec.[3]for the disconnected loops alone. The noise is dominated 
by taking the correlation with the two point function. For the precision of the disconnected loop 
to be matched by that of the two point function, the latter needs to be evaluated for multiple 



source points, eventually in combination with low mode averaging [41]. We note that one does 
not encounter any computational overhead in calculating the disconnected loop at more than 
one timeslice. As long as these are sufficiently separated, the stochastic errors will not increase 

18 



*"ioop =0-166 





*2pt = 


0.166 


*2pt = 


0.1675 


cost 


(N\qq\N)?^ 


(N\qq\N)$ s 


(N\qq\N)?« p 


(N\qq\N)% s 


300 


1.57(43) 




1.90(54) 




200 


1.58(43) 




1.91(54) 




100 


1.62(43) 


1.54(45) 


1.95(53) 


1.92(57) 


50 


1.65(42) 


1.68(46) 


2.00(54) 


2.00(61) 


25 


1.62(42) 


1.65(51) 


1.99(54) 


2.15(71) 


12 


1.48(44) 


1.85(51) 


1.93(59) 


2.44(71) 


6 


1.36(44) 


1.79(62) 


1.81(57) 


2.19(79) 


*ioop = 0.1675 


300 


1.39(63) 




1.36(69) 




200 


1.38(63) 




1.34(69) 




100 


1.35(65) 


1.33(66) 


1.33(72) 


1.16(72) 


50 


1.47(67) 


1.50(68) 


1.44(74) 


1.29(74) 


25 


1.49(70) 


1.62(74) 


1.36(79) 


1.40(81) 


12 


1.33(73) 


1.91(76) 


1.24(85) 


1.64(85) 


6 


1.38(77) 


1.72(88) 


1.30(89) 


1.28(99) 


kioop = 0.1684 


300 


1.45(73) 




1.39(77) 




200 


1.44(73) 




1.38(78) 




100 


1.44(74) 


0.94(64) 


1.38(78) 


0.88(76) 


90 


1.43(74) 


0.92(63) 


1.36(78) 


0.87(75) 


50* 


1.41(74) 




1.30(78) 




25* 


1.50(74) 




1.40(78) 




12* 


1.61(74) 




1.51(77) 




6* 


1.60(74) 




1.52(76) 





Table 5: The disconnected contribution to the scalar matrix element in the lattice scheme, evaluated at different cost 
equivalents. At /q oop = 0.1684 TEA (P20) is employed. In the asterisked rows the costs of generating the eigenvectors 
were neglected. 
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Figure 9: (NlqqlN) 61 " (left) and Ag dls (right) as functions of the quark mass used in the disconnected loop (expressed in 
terms of af?ip S ). The open squares correspond to a proton with Ar2 pt = 0.1675, the filled squares to the heavier /Qpt = 0.166. 
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Table 6: Disconnected contribution to Aq in the lattice scheme, evaluated at different cost equivalents. At Ki oop = 0.1684 
TEA (P20) is employed. In the asterisked rows the costs of generating the eigenvectors were neglected. 

significantly. 

In Table [6] we display the same information as in Table [5] for Aq dls in the lattice scheme. In 
this case, at the cost equivalent of 100 estimates (90 estimates at /ci oop = 0.1684), the gains of 
applying TSM and HPE in terms of computer time, relative to time partitioning alone, are about 
ten-fold. The reductions in error are close to those displayed in Table|4]for the disconnected loop 
alone. This channel is not yet limited by the accuracy of the two point function but of course 
also in this case statistics could be increased by averaging over multiple baryon sources and over 
forward as well as backward propagation. 

In Fig. [9] we display our final tf = 6 results for the two matrix elements. In neither case do we 
observe any significant dependence on the valence quark mass, varying this from mps ~ 600 MeV 
down to 450 MeV, or on the loop quark mass, reducing mps ~ 600 MeV (strange quark mass) to 
m PS ~ 300 MeV. We find As = -0.020(1 1) at the heavier proton mass and As = -0.017(21) at 
the lighter mass value, while the scalar matrix elements are somewhat larger than one. Note that 
the lattice results presented here are unrenormalized. However, based on perturbative two loop 
results 1 58 ] , albeit with different sea quark and gluon actions, we would not expect As to change 
by more than a factor of 0.7 after a translation into the MS scheme. 

5. Conclusions and outlook 

A growing number of Lattice QCD applications requires all-to-all propagators. These are 
most efficiently estimated stochastically. We presented the novel truncated solver method (TSM) 
that typically reduces the computer time to achieve a given stochastic variance by an order of 
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magnitude, without introducing a bias. The gain factors of this method for different observables 
were demonstrated not to vary when changing the quark mass by a factor of four (600 Me V > 
»Jps > 300 Me V). The TSM is easy to implement and can be used for any fermionic action. 
The combination of TSM with different variance reduction techniques is straight forward and 
this reduces the stochastic variance even further. We studied in detail combinations of TSM, 
partitioning [18[], the hopping parameter expansion 112311 and low eigenmode deflation 1 26 , 25 1 . 

In realistic Lattice QCD simulations there are intrinsic errors from the Monte Carlo time 
series, in addition to the errors introduced by the stochastic estimation of the inverse of the 
fermionic matrix on individual configurations. We studied the interplay between these gauge 
and stochastic noises. After reducing the stochastic variance by a combination of methods, in 
our calculation of disconnected contributions to the nucleon structure, the gauge errors became 
the dominant sources of uncertainty. This means that we can afford larger stochastic errors 
and for instance increase the lattice volume, without having to increase the number of random 
sources. For instance, on our 2 fm lattices, even at a cost of only 6 CG solves, the stochastic 
error of the scalar matrix element (N\qq\N) dm is completely over-shadowed by its gauge error: in 
certain situations one can overdo the reduction of the stochastic noise. In this particular case the 
total error can more efficiently be reduced by increasing the number of nucleon sources on each 
configuration than by increasing the number of (improved) estimates. Also the determination of 
As will benefit from this. At present we are pursuing such an approach. 

Our result on the strangeness contribution to the nucleon spin As m is in agreement with 
the other recent direct calculation of this quantity |[59ll . However, we disagree with earlier, less 



precise studies that employed a summation method over t 1 60, 61, 62]. These suggested a value 
As as -0.12. In the case of the scalar matrix element our errors are still too large to state a 
meaningful number, in particular since chiral and infinite volume behaviours need to be studied. 
The techniques developed here are used by us in an ongoing study 116 311 at smaller lattice spacing, 
quark masses and larger volumes with Sheikholeslami-Wohlert sea and valence quarks. 
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Figure 10: f\ and fi (see Eq. 4A. It ) as functions of n t for T = 7375 at K[ oop = 0.166, using the CG solver (left) and the 
BiCGstab2 solver (right). 




Figure 1 1 : Results used for the calculation of the optimal n t and N[ /N2 values for r = 7375 at /q oop = 0. 166. In the left 
figure both sides of Eq. )A.3t are shown and in the right figure Eq. )A.4t . 



Appendix A. TSM parameter tuning for the CG and BiCGstab2 solvers 

The truncated solver method of Sec. l2.3l depends on two parameters, namely on the number of 
iterations for the inexact solves, n t , and on the ratio Ni /N2 of the number of inexact estimates over 
the number of estimates of the correction term, see Eq. ( fT3l ). These parameters need to be fixed, 
ideally, so as to minimize the variance of the estimate of the disconnected loop, E[Tr(M~'r)], at 
fixed cost. The estimates are uncorrected and for Ni,N2^> 1 the variance factorizes into, 

var[Tr(M- 1 D] = var^r^,,,/') + var(<7,|y 5 r( \s) - IW*) = + ^ . (A.1) 

where f\ and fj depend on n t and F. An example of the dependence on n t is shown in Fig. [10] 
for T = 7375 for the CG and the BiCGstab2 solvers. f\ is roughly independent of n t and fa 
decreases rapidly with increasing n t . This behaviour is expected since after a few iterations the 
first term of Eq. (TTTb contains most of the signal (and its error) while the second term (and its 
error) approaches zero. The results were generated on a single configuration at A-i oop = 0.166 
using 300 stochastic sources. Convergence is achieved after n conv « 480 CG iterations. 
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48 
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6.1 
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5.2 7.2 
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6.9 


6.7 
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Table 7: Gains obtained for various n t , setting N1/N2 = filfi, at Ari oop = 0.166, using the BiCGstab2 solver. 



The approximate cost is given by, 

cost ss N\n t + N 2 n com . (A.2) 

When TSM is combined with HPE and/or TEA, there are corrections to this formula which the 
reader can easily work out. Using Lagrange multipliers, we minimize the variance Eq. ( I A. 1 b at 
fixed cost, assuming f\ to be approximately independent of n t . This yields the optimal values, 

(A.3) 
(A.4) 



»° pt = 


1 


/1/2 


72 conv 


/ 2 ' 2 


N\ _ 


fi 


"conv 


N 2 




opt 



where / 2 ' = df 2 /dn t . 

The right hand sides of Eqs. dA.31 > and iAAi can be computed as functions of n t on one 
configuration, using a single set of stochastic sources. By finding the intersection between the 
curve /i/2/(»conv/ 2 ' 2 ) anc ^ n t> one can extract "° pt an d subsequently determine Ni/N 2 for this n t . 
Fig.[TT|illustrates this procedure for T = 7375 for the CG solver. We read off the values n° pt =3 18 
and N\ /N 2 ~ 30 from this figure, see Table [T] 

For our observables we find that when using the optimized TSM parameter values, the two 
variances within Eq. (IA.U are of similar sizes, f\ /Ny a f 2 /N 2 . It also turns out that the gain factor 
does not critically depend on Ni /N 2 . For instance, increasing this ratio from the optimal value of 
30 found for r = 7375 at n t = 1 8 to the equal cost value N\ /N 2 = f\ /f 2 « 35 will increase the final 
variance by just 3 %, which is statistically insignificant. This suggests an alternative criterium 
for fixing the parameters: scanning through n t , keeping N\ /N 2 = f\ /f 2 fixed, to determine the n t 
value with the smallest combined variance. We find that following this strategy reduces no gain 
factor by more than 2%, relative to the gain achieved using the optimal values, at k — 0.166, 
using the CG solver. 

It can be seen from Fig. [I0]that the convergence is no longer a smooth function of n t when 
using the BiCGstab2 solver, so that f 2 cannot be determined. The situation is even worse for 
r = 1. In the BiCGstab2 case we have to resort to the alternative method discussed above. 
Table [7] demonstrates that, using this approach, one can indeed find values n t and N\ /N 2 for 
the BiCGstab2 solver that give reasonable gains. The cost was kept fixed to correspond to 300 
BiCGstab2 solves to convergence. In this case n conv w 156 is by a factor three smaller than for 
the CG, however, each iteration is about twice as expensive. The best gain factors are 5.0 for 
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r = 1 and 7.3 for F = 7375 while for CG, using this method, we are able to achieve very similar 
factors of 4.9 and 7.9, respectively. 

The CG algorithm is more robust than BiCGstab2 and gives nearly optimal gains over a wider 
range of n t values. Note for instance the somewhat erratic behaviour in Table [7] of BiCGstab2 at 
« t = 10. Moreover, the optimal N1/N2 ratios come out rather large, due to tiny fi values, which 
turns BiCGstab2 less optimal when it is combined with HPE or TEA. Therefore, in the context 
of TSM, the CG solver is our preferred choice. However, it is possible to combine both solvers, 
using CG for the truncated solves and BiCGstab2 for running to convergence more efficiently. 
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